Climate change is not a new problem. Although the earliest research was published in the late \(1800\)s, this topic was not taken as a serious global issue until the \(1960\)s.
As the problem of climate change becomes increasingly urgent, data science is also being recognized as an essential tool in the fight against climate change. The skills in data science will allow better understanding of the complex interplay of various environmental factors and take proactive steps to reduce our impact on the planet.
One of the main human impact worsening climate change is our carbon foot print. This leads the motivation of our project to analyze the CO\(2\) emissions produced. The data is sourced from the International Monetary Fund climate change data dashboard. Their mission is to achieve sustainable growth and prosperity for all of its \(190\) member countries by supporting economic policies that promote financial stability and monetary cooperation, which are essential to increase productivity, job creation, and economic well-being.
With that said, this project will be analyzing the combination of two data sets; Direct Investment related indicators and Climate related disaster frequency.
Our combine data consists of CO\(2\) emissions (in Metric Tons) of \(60\) different countries spanning from \(2005\) to \(2018\) by \(4\) types of enterprise and \(15\) sectors. There are four types of enterprise of focus; exports of domestic controlled enterprises, exports of foreign controlled multinational enterprises, output of domestic controlled enterprises, and output of foreign controlled multinational enterprises.
With \(826\) total observations and \(59\) combined variables, we hope to account for the frequency of disaster events by country and year with models listed below in the outline.
# calling required packages
library(readr)
library(tidyverse)
library(tidymodels)
library(randomForest)
library(parsnip)
library(workflows)
library(parallel)
library(doParallel)
library(yardstick)
library(ggplot2)
library(xgboost)
library(yardstick)
library(kknn)
library(poissonreg)
library(car)
library(ggwordcloud)
tidymodels_prefer()
theme_set(theme_bw())
# To prevent connection lost
unregister_dopar <- function() {
env <- foreach:::.foreachGlobals
rm(list=ls(name=env), pos=env)
}
unregister_dopar()
In this project, we utilized two datasets, \(11\)_Direct_Investment_related_indicators.csv and \(24\)_Climate_related_Diasters_Frequency.csv. The former dataset contains CO\(2\) emission for each country from years \(2005\) to \(2018\), for each country, sectors, and relative to either export or domestic of either domestically or foreign controlled enterprises. The later dataset contains disaster count from years \(1980\) to \(2019\) for each country and by different types of disaster.
#######Combining the two data sets#######
df <- emission_df %>% left_join(disasters_df, by = c("ISO3", "Year"))
df %>% mutate_at("Year", as.factor)
# remove data not going to be used
rm(ECCD_df, disasters_frequency)
#####More Manipulation to the Dataset#####
# relocating Year to the first
df <- df %>% relocate(Year, .before = Sector_abb)
# replace all na with 0
df <- df %>% mutate_all(~ ifelse(is.na(.), 0, .))
# renaming the branches
df <- df %>% rename(Export_Domestics = ECBIXD,
Export_Foreign = ECBIXF,
Output_Domestic = ECBIOD,
Output_Foreign = ECBIOF,
Disaster_Frequency = ECCD)
# pivot longer the branches
df_long <- df %>% pivot_longer(cols = Export_Domestics:Output_Foreign,
names_to = "Category",
values_to = "Value")
# pivot wider the combinations of branches and category
df_wide <- df_long %>% pivot_wider(names_from = c("Sector_abb", "Category"),
values_from = Value,
values_fill = 0,
names_glue = "{Category}_{Sector_abb}")
# convert year from character to factor
df_wide <- df_wide %>%
mutate(Year = as.factor(Year))
# Writing and saving data as csv
# tidy_wide <- write.csv(df_wide,"Pathname/tidydata.csv", row.names = FALSE)
# tidy_long <- write.csv(df_long,"Pathname/tidydata_long.csv", row.names = FALSE)
# save(df_wide, file = "df_wide.csv")
# save(df_long, file = "df_long.csv")
There are two tidy versions of the data sets saved under df_long and df_wide. Df_long depicts the four variables of interest and df_wide stratifies the type of enterprises by sectors.
We decide to do an 80-20 split of the train and test data set.
set.seed(123457)
# train test split by assigning 80% to train and remaining 20% to test
df_split <- df_wide %>% initial_split(prop = 0.8)
train <- df_split %>% training()
test <- df_split %>% testing()
# dropping country indicator
train_sub <- train %>% select(-c(ISO3))
test_sub <- test %>% select(-c(ISO3))
Examine the effects of domestically- versus foreign- controlled enterprises’ CO\(2\) emissions relative to sectors of activities performed for each country on climate related disasters.
TARGET VARIABLE: Disaster_Frequency
* Disaster by Year
* World Map of CO$2$ Emissions
* Disaster by Year for Each Country
* Emission versus Disaster Count for Each Country
* CO$2$ Emissions by Type of Enterprise
Climate change is a global issue leading to higher natural disasters frequency. The line graph below depicts the total disaster event by year on a global level. Ever since the 1900s, natural disasters have been on the rise. Due to our incomplete data spanning from 2005 to 2008, this is a period of decline of disasters with 2005 having the most events.
# prep dataframe for plot
df_aggregate_2 <- df_wide %>%
group_by(Year, ISO3) %>%
reframe(diaster_count = first(Disaster_Frequency)) %>%
group_by(Year) %>%
summarize(total_disaster_frequency = sum(diaster_count))
# converting Year to numeric
df_aggregate_2$Year <- as.numeric(as.character(df_aggregate_2$Year))
# ggplot using aggregate dataset
ggplot(df_aggregate_2, aes(x = Year, y = total_disaster_frequency)) +
geom_line(color = "blue") +
geom_point(color = "blue") +
labs(title = "Total Disaster Frequency by Year ",
x = "Year",
y = "Sum Disaster Frequency")
Diaster by year. Count of total disaster decreased from years \(2005\) to \(2008\). This is a decreasing trend because of we are focusing on a relatively short time span. If one were to look at the count of diaster at a bigger time frame, one should see a general upward trend.
Carbon dioxide is a greenhouse gas that traps heat in the atmosphere, leading to global warming and climate change. It is produce primarily from human activities such as burning fossil fuels for energy, transportation, and other industry sectors. Below is a map of CO\(2\) emissions produce in each country with red being the higher end of the spectrum and white being little CO\(2\) produced.
# load world data for map
library(maps)
world <- map_data("world")
# data prepping for the map
# rename the Country to match Country names stored in the world data
Country_label <- Country_label %>%
mutate(Country = if_else(Country == "China, P.R.: Hong Kong", "China", Country)) %>%
mutate(Country = if_else(Country == "China, P.R.: Mainland", "China", Country)) %>%
mutate(Country = if_else(Country == "Taiwan Province of China", "Taiwan", Country)) %>%
mutate(Country = if_else(Country == "Korea, Rep. of", "South Korea", Country)) %>%
mutate(Country = if_else(Country == "Netherlands, The", "Nether", Country)) %>%
mutate(Country = if_else(Country == "Poland, Rep. of", "Poland", Country)) %>%
mutate(Country = if_else(Country == "Russian Federation", "Russia", Country)) %>%
mutate(Country = if_else(Country == "Slovak Rep.", "Slovakia", Country)) %>%
mutate(Country = if_else(Country == "Slovenia, Rep. of", "Slovenia", Country)) %>%
mutate(Country = if_else(Country == "Estonia, Rep. of", "Estonia", Country)) %>%
mutate(Country = if_else(Country == "Czech Rep.", "Czech Republic", Country)) %>%
mutate(Country = if_else(Country == "Croatia, Rep. of", "Croatia", Country)) %>%
mutate(Country = if_else(Country == "United States", "USA", Country))
# create data frame for map
bind <- df_long %>%
group_by(ISO3, Year, Category) %>%
reframe(Value = sum(Value)) %>%
left_join(Country_label, by="ISO3") %>%
select(c(Country,Value)) %>%
group_by(Country) %>% reframe(Value = sum(Value))
# combine the two China values
bind <- bind %>% group_by(Country) %>% summarize(Value = sum(Value))
# join the two datasets
world <- world %>% full_join(bind, by = c("region" ="Country"))
world <- world %>% mutate_all(~ ifelse(is.na(.), 0, .))
This data preparation is necessary for the map to join in order to include all the data points.
# plot the map, set emission as fill color.
ggplot(world, aes(long, lat, group=group, fill = log(Value+1))) +
geom_polygon(color="gray") +
scale_fill_gradient2(low = "lightgray", mid = "yellow", high = "red",
midpoint = 10, limits = c(0, 20)) +
coord_fixed() +
ggtitle("Map of CO2 Emission Recorded in Dataset") +
labs(fill='Log of Metric \nTons of CO2 \nEmission')
Map of CO\(2\) emission recorded in our dataset. Fill is by log of metric tons of CO\(2\) emission. Hue closer to red indicates more emission.
It is depicted that developed countries, such as the United States, China, Russia, and parts of Europe, produces more CO\(2\) emission than underdeveloped countries.
After seeing the countries with prolific production of CO\(2\) emission, we have suspects for those that may have higher frequency of natural disasters. The graph below counts the number of disasters by year colored based on countries. This figure confirms our suspicions of more disasters will occur in developed countries.
The top 3 countries with the most disasters are USA, China, and India.
# prepare data for the line graph.
# obtain 10 countries with the most diaster count for every year.
df_aggregate_1 <- df_long %>%
group_by(Year, ISO3) %>%
reframe(total_disaster = first(Disaster_Frequency)) %>%
group_by(Year) %>%
slice_max(total_disaster, n = 10)
# convert Year to numeric
df_aggregate_1$Year <- as.numeric(as.character(df_aggregate_1$Year))
# ggplot using aggregate dataset
ggplot(df_aggregate_1, aes(x = Year, y = total_disaster, color = ISO3)) +
geom_line() +
geom_point() +
labs(title = "Count of Disaster by Year for Top 10 Countries with Most Diasters",
x = "Year",
y = "Sum of Disaster",
color = "Country")
Count of disaster by year for countries with most diasters. This plot includes the \(10\) countries with most diaster count for every year. As can be seen that some country, such as China, India, and USA are consistently ranked as the top \(10\) for every year. These three countries have one characteristic in common, that their land area are large in comparison with other countries.
In the scatterplot below, we explored the countries with most disaster by the log of CO\(2\) emissions by metric tons. This graph only includes countries with greater than 10 events and is faceted by year. As you can see from this faceted scatter plot, there is a lack of data for 2017 and 2018. USA, China, and India consistently is in the top producers of carbon dioxide correlating with higher count of disasters by year.
# optimize data for scatter plot
scatterplot_df <- df_long %>% group_by(ISO3, Year) %>%
reframe(emission = sum(`Value`),
Disaster_Frequency = first(Disaster_Frequency))
# convert country (IOS3) as factor
scatterplot_df$ISO3 <- as.factor(scatterplot_df$ISO3)
# plot the scatter plot
scatterplot_df %>% ggplot(aes(x = log(emission + 0.0001),
y = Disaster_Frequency)) +
geom_point() +
geom_point(data = scatterplot_df[scatterplot_df$Disaster_Frequency>10, ],
aes(x = log(emission + 0.0001),
y = Disaster_Frequency,
color = ISO3)) +
facet_wrap(~Year) +
ylab("Disaster Count") +
xlab("Log of Emission (Log of Metric Tons of CO2)") +
ggtitle("Emission versus Disaster Count for Each Country")
Emission versus diaster count for each country. In this plot, country with more than \(10\) diasters per year are highlighted. There appears to be a linear relationship between the log of emission and disaster count. Countries such as China and USA consistently have greater than \(10\) diasters for every year.
The four types of enterprises are exports of domestic controlled enterprises, exports of foreign controlled multinational enterprises, output of domestic controlled enterprises, and output of foreign controlled multinational enterprises.
Exports are shipment of goods or services from one country to another. Output refers to the production of goods or services within a particular industry or sector. Domestic refers to products or services that are produced or provided within a particular country and intended for use or consumption within that country. Foreign refers to products or services that are produced or provided outside of one’s own country and intended for use or consumption in other countries. The type of enterprises will be a combination of either export or output and foreign or domestic. For example, exports of domestic controlled enterprises (‘Export_Domestics’) are goods that are produced by companies that are headquartered outside the country where they are sold.
The stacked bar graph depicts the CO\(2\) emissions produced by the category of enterprises.
# prep the dataframe for bar plot.
df_agg <- df_long %>%
group_by(ISO3, Category) %>%
reframe(emission = sum(`Value`)) %>%
mutate_if(is.character, as.factor) %>%
group_by(Category) %>%
arrange(desc(emission)) %>%
slice(1:10)
# CO2 by top 10 country and fill by category.
ggplot(df_agg,aes(x = ISO3, y = emission, fill = Category)) +
geom_bar(stat = "identity") +
# facet_wrap(~Category)+
labs(
title = "CO2 emissions by Category of Enterprise for Each Country",
x = "Country",
y = "Metric Tons of CO2 Emission"
)
CO\(2\) emissions by category of enterprise for each country. USA seems to come in the first place in terms of CO\(2\) emission, followed by India and China. The emissions of China and India are relatively the same. As a general trend, Output Domestic seems to be the major contributor of CO\(2\) emission.
In developed countries, output of domestic controlled enterprises produce the most CO\(2\) emissions by metric tons in comparison to other categories. Therefore, the most CO\(2\) produced are from the goods or services produced by companies that are owned and controlled within the country where they operate. This indicates countries like the USA are producing goods that feeds back into its own economy.
* Random Forest with hyper-parameter tuning
* XG-Boosting with hyper-parameter tuning
* Poisson Generalized Linear Regression with Lasso
* Poisson Generalized Linear Regression with Elastic Net
* K-Nearest Neighbor Regression
+ K = 5, 10, and 20
In modeling Random Forest, we will attempt with tuning the hyper-parameters including number of predictors at each split (mtry), number of trees for each ensemble (trees), and minimum number of data points at each split (min_n). The tuning is performed on three levels for each of the hyper-parameters, in total of \(3^3 = 27\) possible combinations. A \(10\)-fold cross validation is performed twice to avoid sampling bias. Hence, in total \(540\) models will be evaluated to obtain the best model. Further, to improve the modeling, categorical variables will be encoded into dummy variables and all numerical predictors will be normalized. Metrics used to evaluate the models are Mean Absolute Error (mae), Root Mean Squared Error (rmse), and residual Squared (rsq), each are a measure of deviation or error of predicted values from observed values.
# set the seed
set.seed(123457)
# define parsnip, set random forest with mtry, trees, and min_n hyperparameters
# as tuning. To be tuned later.
parsnip_RF <- rand_forest(mtry = tune("mtry"),
trees = tune("trees"),
min_n = tune("min_n")) %>%
set_engine('randomForest') %>%
set_mode('regression') #set random forest to regression
# define random forest recipe, code all nominal predictors dummy variables and
# normalize all numeric predictors.
RF_recipe <-
recipe(formula = Disaster_Frequency ~ ., data = train_sub) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
# define workflow
workflow_RF <- workflow() %>%
add_model(parsnip_RF) %>%
add_recipe(RF_recipe)
# define hyperparameter tuning grid to be experimented
hyperparam_tune_grid <- crossing(min_n = seq(10, 50, by = 20),
mtry = seq(10, 50, by = 20),
trees = c(100, 500, 1000))
# define cross validation
rf_cv <- train_sub %>% vfold_cv(v = 10, times = 2)
# define metrics to be used. Note that we define the rsq within the tidymodels
# context. In this case, rsq is the measure of correlation rather than an error
# quantification measure.
RF_metrics <- metric_set(yardstick::rmse, mae, rsq)
The following performs parallel processing for the hyper-parameter tuning and cross validation.
# assigning resources
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
time1 <- Sys.time() #save current system time
#run the 10 fold cross validation for each combination of hyperparameters
rf_tuning <- workflow_RF %>%
tune_grid(resamples = rf_cv,
grid = hyperparam_tune_grid,
metrics = RF_metrics) %>%
collect_metrics()
time2 <- Sys.time() #save current system time
(diff <- time2 - time1) #obtain total run time
stopCluster(cl)
save(rf_tuning, file = "RandomForest_Prediction.rds")
The following retrieves run result of parallel processing. Both the mean absolute error and root mean squared error suggests that hyper-parameter combination of mtry \(= 30\), trees \(=100\), and min_n \(=10\) yields the best model. The residual squared supports that hyper-parameter combination of mtry \(= 10\), trees \(=500\), and min_n \(=10\) will yield the best model. Hence, we proceed with the former hyper-parameter combination
RF_prediction <- load("Scripts/RandomForest_Prediction.rds")
# selecting the best hyperparameter combination
Best_result1 <- rf_tuning %>%
filter(!(.metric == "rsq")) %>%
group_by(.metric) %>%
slice_min(mean)
Best_result2 <- rf_tuning %>%
filter(.metric == "rsq") %>%
group_by(.metric) %>%
slice_max(mean)
Best_result3 <- Best_result1 %>% bind_rows(Best_result2)
Best_result3
By fitting the model with hyper-parameter combination suggested above, the model explains roughly \(70%\) of the variation in the model. The mean of squared residuals (or mse) is \(8.411\). This implies that the fitted values are on average \(8.411\) away from the observed values.
# Fitting the train data with best hyperparameter combination
# Since both rmse and rsq_trad supports mtry = 50, trees = 1000,
# and min_n = 10, we fit that to the train set.
parsnip_RF_train <- rand_forest(mtry = 30,
trees = 100,
min_n = 10) %>%
set_engine('randomForest') %>%
set_mode('regression')
# define workflow
workflow_RF_train <- workflow() %>%
add_model(parsnip_RF_train) %>%
add_recipe(RF_recipe)
# fit to the train data
RF_fit <- workflow_RF_train %>% fit(data = train_sub)
# define function to calculate mae
calculate_mae <- function(actual, predicted){
mae <- mean(abs(actual - predicted))
}
# define function to calculate rmse
calculate_rmse <- function(actual, predicted){
mse <- mean((actual - predicted)^2)
rmse <- sqrt(mse)
}
# display result
RF_rsq <- RF_fit$fit$fit$fit$rsq %>% mean()
RF_mae <- calculate_mae(train_sub$Disaster_Frequency, RF_fit$fit$fit$fit$predicted)
RF_rmse <- calculate_rmse(train_sub$Disaster_Frequency, RF_fit$fit$fit$fit$predicted)
cat("Metrics of Random Forest: rsq =", RF_rsq, ", mae =", RF_mae,
", and rmse=", RF_rmse, ".")
## Metrics of Random Forest: rsq = 0.6973855 , mae = 1.717929 , and rmse= 2.900246 .
By performing prediction on both the train set and test set, we conclude that the performance of our model is fair. The prediction on the train set scored an rsq of \(0.837\) with rmse of \(2.183\) and mae of \(1.059\). The performance on the test set is lower in comparison to train set, roughly \(0.765\)(rsq). The mae and rmse of the prediction on test set is only a fraction bigger than the prediction on train set. The model performance on the test set is lower than on the train set.
# prediction on the train set
RF_train_predict <- RF_fit %>% predict(train_sub)
RF_train_predict_metric <- RF_train_predict %>%
mutate(truth = train_sub$Disaster_Frequency, estimate = .pred) %>%
RF_metrics(truth = truth, estimate = .pred)
RF_train_predict_metric
# predict the test set and evaluate
RF_test_predict <- RF_fit %>% predict(test_sub)
RF_test_predict_metric <- RF_test_predict %>%
mutate(truth =test_sub$Disaster_Frequency, estimate = .pred) %>%
RF_metrics(truth = truth, estimate = .pred)
RF_test_predict_metric
In modeling using Boosting, we will attempt to tune the hyper-parameters to improve model performance. The hyper parameter as depth of each trees (tree_depth), learning rate of XG-Boost (learn_rate), and minimum number of data points at each split (min_n). Three levels of each hyper-parameters are tried, giving \(3^3= 27\) possible combinations. A \(10\)-fold cross validation is performed twice to avoid sampling bias. In total \(540\) model are being evaluated. Further more, to improve model performance, all nominal predictors are coded as dummy variables and all numerical predictors are normalized.
# set the seed
set.seed(123457)
# define parsnip, set random forest with mtry, trees, and min_n hyperparameters
# as tuning. To be tuned later.
parsnip_boost <- boost_tree(tree_depth = tune("tree_depth"),
learn_rate = tune("learn_rate"),
min_n = tune("min_n")) %>%
set_engine('xgboost') %>%
set_mode('regression')
# define recipe for boosting, one hot encode dummy variable for nominal predictors
# and normalize all numeric predictors.
boost_recipe <-
recipe(formula = Disaster_Frequency ~ ., data = train_sub) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
# define workflow
workflow_boost <- workflow() %>%
add_model(parsnip_boost) %>%
add_recipe(boost_recipe)
# define hyperparameter tuning grid to be experimented
hyperparam_tune_grid_boost <- crossing(min_n = seq(10, 50, by = 20),
tree_depth = c(5, 10, 15),
learn_rate = c(0.01, 0.05, 0.1))
# define cross validation
boost_cv <- train_sub %>% vfold_cv(v = 10, times = 2)
# define metrics to be used
boost_metrics <- metric_set(rmse, mae, rsq)
The following utilizes parallel processing to evaluate the models.
# assigning resources
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
time1 <- Sys.time() #save current system time
#run the 10 fold cross validation for each combination of hyperparameters
boost_tuning <- workflow_boost %>%
tune_grid(resamples = boost_cv,
grid = hyperparam_tune_grid_boost,
metrics = boost_metrics) %>%
collect_metrics()
time2 <- Sys.time() #save current system time
(diff <- time2 - time1) #obtain total run time
stopCluster(cl)
save(boost_tuning, file = "XGBoost_Prediction.rds")
After retrieving the run result from parallel processing, all three metrics rsq, mae and rmse supports hyper-parameter combination of min_n \(= 10\), tree_depth \(=15\), and learn_rate \(=0.1\) yields the best model.
Boost_prediction <- load("Scripts/XGBoost_Prediction.rds")
Best_result4 <- boost_tuning %>%
filter(!(.metric == "rsq")) %>%
group_by(.metric) %>%
slice_min(mean)
Best_result5 <- boost_tuning %>%
filter(.metric == "rsq") %>%
group_by(.metric) %>%
slice_max(mean)
Best_result6 <- Best_result4 %>% bind_rows(Best_result5)
Best_result6
# Fitting the train data with best hyperparameter combination
# all metrics support the combination of min_n = 10,
# tree_depth = 10, and learn_rate = 0.1, we use it as our model for train set.
parsnip_Boost_train <- boost_tree(tree_depth = 10,
learn_rate = 0.1,
min_n = 10) %>%
set_engine('xgboost') %>%
set_mode('regression')
# define workflow
workflow_Boost_train <- workflow() %>%
add_model(parsnip_Boost_train) %>%
add_recipe(boost_recipe)
# fit to the train data
boost_fit <- workflow_Boost_train %>% fit(data = train_sub)
boost_fit$fit$fit$fit$evaluation_log %>% slice_min(training_rmse)
We see that the performance of boosting on the train set is roughly the same as Random Forest, however a little lower. This is also true for boosting performance on the test set.
# predict the train set and evaluate
boost_train_predict <- boost_fit %>% predict(train_sub)
boost_train_predict_metrics <- boost_train_predict %>%
mutate(truth = train_sub$Disaster_Frequency, estimate = .pred) %>%
boost_metrics(truth = truth, estimate = .pred)
boost_train_predict_metrics
# predict the test set and evaluate
boost_test_predict <- boost_fit %>% predict(test_sub)
boost_val <- boost_test_predict %>%
mutate(truth = test_sub$Disaster_Frequency, estimate = .pred) %>%
boost_metrics(truth = truth, estimate = .pred)
boost_val
boost_test_predict_metrics <- boost_test_predict %>%
mutate(truth = test_sub$Disaster_Frequency, estimate = .pred) %>%
boost_metrics(truth = truth, estimate = .pred)
boost_test_predict_metrics
Poisson generalized linear regression model is commonly used for count data analysis, where the outcome variable on interest is disaster frequency with the \(58\) combination of sectors and category of enterprises are the predictors.
The Lasso penalty will help regularize and shrink the coefficients of the predictor variables towards zero, effectively eliminating some of the less important predictors from the model.
set.seed(123457)
# subsetting train dataset. excluding ISO3 and Year
train_sub <- train %>% select(-c(ISO3))
# Poisson Generalized Linear Regression Model with Lasso Penalty
poi_recipe <- recipe(Disaster_Frequency ~., data = train_sub) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
# defining model using poisson_reg()
poi_parsnip <- poisson_reg(penalty = "lasso") %>%
set_mode("regression") %>%
set_engine("glm")
# defining workflow with model and recipe
poi_workflow <- workflow() %>%
add_model(poi_parsnip) %>%
add_recipe(poi_recipe)
# Cross validation on train dataset
poi_result <- poi_workflow %>%
fit_resamples(
resamples = vfold_cv(train_sub, v = 10),
metrics = metric_set(rmse, rsq, mae)
)
Metrics for the training data below shows unfavorable results with MAE and RMSE values in the seven or eight figures. The RSQ value indicating only a \(21.7\%\) of accountability for this model.
# showing results
poi_result %>% collect_metrics()
poi_fitted <- poi_workflow %>% fit(train_sub)
poi_fitted
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: poisson_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_dummy()
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::poisson, data = data)
##
## Coefficients:
## (Intercept) Export_Domestics_Agriculture
## 0.748860 -0.309675
## Export_Foreign_Agriculture Output_Domestic_Agriculture
## 0.229349 0.934654
## Output_Foreign_Agriculture Export_Domestics_Construction
## -0.579262 -0.061464
## Export_Foreign_Construction Output_Domestic_Construction
## 0.095968 -0.523704
## Output_Foreign_Construction Export_Domestics_Education
## -0.005035 -0.267180
## Export_Foreign_Education Output_Domestic_Education
## 0.254784 -0.763702
## Output_Foreign_Education Export_Domestics_Electronic
## -0.083098 0.676342
## Export_Foreign_Electronic Output_Domestic_Electronic
## -1.157403 -0.108360
## Output_Foreign_Electronic Export_Domestics_Energy
## 1.438619 -0.213684
## Export_Foreign_Energy Output_Domestic_Energy
## -0.059051 1.106137
## Output_Foreign_Energy Export_Domestics_Food
## 0.238033 -0.506095
## Export_Foreign_Food Output_Domestic_Food
## -0.242648 1.645598
## Output_Foreign_Food Export_Domestics_HumanActivities
## 1.558337 1.194722
## Export_Foreign_HumanActivities Output_Domestic_HumanActivities
## -0.104619 -1.471897
## Output_Foreign_HumanActivities Export_Domestics_Machinery
## -0.182742 -1.074674
## Export_Foreign_Machinery Output_Domestic_Machinery
## 0.838602 0.906224
## Output_Foreign_Machinery Export_Domestics_Manufacturing
## -0.906660 1.110798
## Export_Foreign_Manufacturing Output_Domestic_Manufacturing
## -1.072520 -0.647625
## Output_Foreign_Manufacturing Export_Domestics_Mining
## 1.381830 0.085472
## Export_Foreign_Mining Output_Domestic_Mining
## 0.331654 0.067772
## Output_Foreign_Mining Export_Domestics_Pharma
## -0.358383 -0.916283
## Export_Foreign_Pharma Output_Domestic_Pharma
## -0.508127 -1.269727
## Output_Foreign_Pharma Export_Domestics_RawMaterial
## 0.323194 -0.404092
##
## ...
## and 28 more lines.
goal_metrics <- metric_set(rmse, rsq, mae)
# predict on test test
poi_prediction_1 <- poi_fitted %>% predict(test_sub) %>%
mutate(truth = test$Disaster_Frequency, estimate = .pred) %>%
goal_metrics(truth = truth, estimate = .pred)
poi_prediction_1
Poisson with elastic net allows \(2\) penalties that can be used to select the most important predictors and avoid overfitting, even in the presence of multicollinearity or high-dimensional data.
# Poisson Generalized Linear Model with Elastic Net Penalty
set.seed(123457)
# defining recipe
poi_recipe_enp <- recipe(Disaster_Frequency ~., data = train_sub) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
# defining model using poisson_reg with elastic net penalty
poi_parsnip_enp <- poisson_reg(penalty = "elastic_net") %>%
set_mode("regression") %>%
set_engine("glm")
# defining workflow with model and recipe
poi_workflow_enp <- workflow() %>%
add_model(poi_parsnip_enp) %>%
add_recipe(poi_recipe_enp)
# cross validation on train dataset
poi_result_enp <- poi_workflow_enp %>%
fit_resamples(
resamples = vfold_cv(train_sub, v = 10),
metrics = metric_set(rmse, rsq, mae),
)
Metrics for the Poisson with elastic net below shows unfavorable results with MAE and RMSE values in the five or six figures. The RSq value indicating only a \(26.8\%\) of accountability for this model. Although this is slightly better than the Poisson with lasso, it is still extremely unfavorable compared to XG-Boost and Random Forest models.
# showing results
poi_result_enp %>% collect_metrics()
#fitting model on train
poi_fitted_enp <- poi_workflow_enp %>% fit(train_sub)
poi_fitted_enp
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: poisson_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_dummy()
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::poisson, data = data)
##
## Coefficients:
## (Intercept) Export_Domestics_Agriculture
## 0.748860 -0.309675
## Export_Foreign_Agriculture Output_Domestic_Agriculture
## 0.229349 0.934654
## Output_Foreign_Agriculture Export_Domestics_Construction
## -0.579262 -0.061464
## Export_Foreign_Construction Output_Domestic_Construction
## 0.095968 -0.523704
## Output_Foreign_Construction Export_Domestics_Education
## -0.005035 -0.267180
## Export_Foreign_Education Output_Domestic_Education
## 0.254784 -0.763702
## Output_Foreign_Education Export_Domestics_Electronic
## -0.083098 0.676342
## Export_Foreign_Electronic Output_Domestic_Electronic
## -1.157403 -0.108360
## Output_Foreign_Electronic Export_Domestics_Energy
## 1.438619 -0.213684
## Export_Foreign_Energy Output_Domestic_Energy
## -0.059051 1.106137
## Output_Foreign_Energy Export_Domestics_Food
## 0.238033 -0.506095
## Export_Foreign_Food Output_Domestic_Food
## -0.242648 1.645598
## Output_Foreign_Food Export_Domestics_HumanActivities
## 1.558337 1.194722
## Export_Foreign_HumanActivities Output_Domestic_HumanActivities
## -0.104619 -1.471897
## Output_Foreign_HumanActivities Export_Domestics_Machinery
## -0.182742 -1.074674
## Export_Foreign_Machinery Output_Domestic_Machinery
## 0.838602 0.906224
## Output_Foreign_Machinery Export_Domestics_Manufacturing
## -0.906660 1.110798
## Export_Foreign_Manufacturing Output_Domestic_Manufacturing
## -1.072520 -0.647625
## Output_Foreign_Manufacturing Export_Domestics_Mining
## 1.381830 0.085472
## Export_Foreign_Mining Output_Domestic_Mining
## 0.331654 0.067772
## Output_Foreign_Mining Export_Domestics_Pharma
## -0.358383 -0.916283
## Export_Foreign_Pharma Output_Domestic_Pharma
## -0.508127 -1.269727
## Output_Foreign_Pharma Export_Domestics_RawMaterial
## 0.323194 -0.404092
##
## ...
## and 28 more lines.
#metrics
goal_metrics <- metric_set(rmse, rsq, mae)
# predict on test test
poi_prediction_2 <- poi_fitted_enp %>% predict(test_sub) %>%
mutate(truth = test$Disaster_Frequency, estimate = .pred) %>%
goal_metrics(truth = truth, estimate = .pred)
poi_prediction_2
# poisson with lasso, prediction on test set
goal_metrics <- metric_set(rmse, rsq, mae)
# predict on test test
poi_prediction_1 <- poi_fitted %>% predict(test_sub) %>%
mutate(truth = test$Disaster_Frequency, estimate = .pred) %>%
goal_metrics(truth = truth, estimate = .pred)
poi_prediction_1
# poisson with elastic net, prediction on test set
goal_metrics <- metric_set(rmse, rsq, mae)
# predict on test test
poi_prediction_2 <- poi_fitted_enp %>% predict(test_sub) %>%
mutate(truth = test$Disaster_Frequency, estimate = .pred) %>%
goal_metrics(truth = truth, estimate = .pred)
poi_prediction_2
poison_val<- bind_rows(poi_prediction_1,poi_prediction_2, .id="Model") %>%
mutate(Model = dplyr::recode(Model,
"1" = "Poisson Lasso",
"2" = "Poisson Elastic Net",
)) %>%
pivot_wider(names_from = c(".metric"),
values_from = .estimate)
poison_val
The metrics table above for Poisson regression of held-out data is very poor. Regardless of the penalty (lasso or elastic net) the metric values are the same for all.
At this point, Random Forest is in the lead for the best model to predict disaster frequency. Perhaps, K-nearest neighbor can have more promising results.
K-Nearest Neighbor is a simple yet effective algorithm, but its performance can be impacted by the choice of the hyper parameter k. However, it can be computationally expensive depending on the size of the data. Here we have chosen k to be 5, 10, 20. We expect the more neighbors the better the model. Metrics used to evaluate the models are Mean Absolute Error (mae), Root Mean Squared Error (rmse), and residual Squared (rsq), each are a measure of deviation or error of predicted values from observed values.
# Set seed
set.seed(1111)
# k=5
knn_parsnip_5 <- nearest_neighbor() %>%
set_mode("regression") %>%
set_engine("kknn", neighbors = 5000)
knn_recipe_5 <- recipe(Disaster_Frequency ~ .,
data = train_sub) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
knn_workflow_5 <- workflow() %>%
add_model(knn_parsnip_5) %>%
add_recipe(knn_recipe_5)
knn_5 <- knn_workflow_5 %>% fit(train_sub)
knn_5_train_pred <- knn_5 %>% predict(train_sub)
# k=10
knn_parsnip_10 <- nearest_neighbor() %>%
set_mode("regression") %>%
set_engine("kknn", neighbors = 10)
knn_recipe_10 <- recipe(Disaster_Frequency ~ .,
data = train_sub) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
knn_workflow_10 <- workflow() %>%
add_model(knn_parsnip_10) %>%
add_recipe(knn_recipe_10)
knn_10 <- knn_workflow_10 %>% fit(train_sub)
knn_10_train_pred <- knn_10 %>% predict(train_sub)
# k=20
knn_parsnip_20 <- nearest_neighbor() %>%
set_mode("regression") %>%
set_engine("kknn", neighbors = 20)
knn_recipe_20 <- recipe(Disaster_Frequency ~ .,
data = train_sub) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
knn_workflow_20 <- workflow() %>%
add_model(knn_parsnip_20) %>%
add_recipe(knn_recipe_20)
knn_20 <- knn_workflow_20 %>% fit(train_sub)
knn_20_train_pred <- knn_20 %>% predict(train_sub)
# In-Sample validation
# k=5
knn_val1<- knn_workflow_5 %>%
fit_resamples(
resamples = vfold_cv(train_sub, v = 10),
metrics = metric_set(rmse, rsq, mae),
)
knn_val1 %>% collect_metrics()
# k=10
knn_val2 <- knn_workflow_10 %>%
fit_resamples(
resamples = vfold_cv(train_sub, v = 10),
metrics = metric_set(rmse, rsq, mae),
)
knn_val2 %>% collect_metrics()
# k=20
knn_val3 <- knn_workflow_10 %>%
fit_resamples(
resamples = vfold_cv(train_sub, v = 10),
metrics = metric_set(rmse, rsq, mae),
)
knn_val3 %>% collect_metrics()
#
metrics_knn_held <- bind_rows(
knn_val1 %>% collect_metrics(),
knn_val2 %>% collect_metrics(),
knn_val3 %>% collect_metrics(),
.id = "Model"
)
metrics_knn_held
Since MAE and RMSE are measure of deviation of error, we would hope for a low value. The hyperparameters chosen resulted in little difference between metrics. The MAE and RMSE exhibited the almost the same value of \(2.6\) and \(4.3\). R squared metric where \(k=5\) is \(0.37\) and \(k=10\) it is \(0.40\). This \(3\) percent increase continues in contrasting where \(k=10\) and \(k=20\), with \(k=20\) having the best hyper-parameter of \(0.43\).
knn_5t <- knn_5 %>% predict(test_sub)
knn_10t <- knn_10 %>% predict(test_sub)
knn_20t <- knn_20 %>% predict(test_sub)
rsq_manual <- function (x, y){
cor(x, y) ^ 2
}
knn_5t_rsq <- rsq_manual(test_sub$Disaster_Frequency, knn_5t$.pred)
knn10t_rsq <- rsq_manual(test_sub$Disaster_Frequency, knn_10t$.pred)
knn20t_rsq <- rsq_manual(test_sub$Disaster_Frequency, knn_20t$.pred)
knn_5t_mae <- calculate_mae(test_sub$Disaster_Frequency, knn_5t$.pred)
knn_10t_mae <- calculate_mae(test_sub$Disaster_Frequency, knn_10t$.pred)
knn_20t_mae <- calculate_mae(test_sub$Disaster_Frequency, knn_20t$.pred)
knn_5t_rmse <- calculate_rmse(test_sub$Disaster_Frequency, knn_5t$.pred)
knn_10t_rmse <- calculate_rmse(test_sub$Disaster_Frequency, knn_10t$.pred)
knn_20t_rmse <- calculate_rmse(test_sub$Disaster_Frequency, knn_20t$.pred)
knn_held_metrics <- data.frame(models = c("knn5", "knn10", "knn20"),
rsq = c(knn_5t_rsq, knn10t_rsq, knn20t_rsq),
mae = c(knn_5t_mae, knn_10t_mae, knn_20t_mae),
rmse = c(knn_5t_rmse, knn_10t_rmse, knn_20t_rmse)) %>%
pivot_longer(cols = rsq:rmse)
knn_held_metrics
All k-nearest neighbor models generated the same predictive values leading to the same held_out metrics where mae equals to \(2.45\) and rmse to \(3.94\). The model accounts for only \(47\%\) of the variability of disaster frequency.
From the diagnostic plot, we can interpret that our model prediction complied well with the observed values. However, the dataset seems to contain quite a few outliers of which may have a significant effect on our model.
# Diagnostic Plots
# setting up plot environment
par(mfrow = c(2,2), mar = c(4,4,5,4))
# Scatter plot of Actual versus predicted
plot(train_sub$Disaster_Frequency, RF_train_predict$.pred,
main = "Actual versus Predicted",
ylab = "Prediction Result",
xlab = "Actual Result")
# calculate residuals
resid <- train_sub$Disaster_Frequency - RF_train_predict$.pred
# fitted versus residual plot
fitted_resid_plot <- plot(RF_train_predict$.pred, resid,
ylab = "Residuals",
xlab = "Fitted Values",
main = "Fitted versus Residuals")
abline(h = 0, col = 'blue', lty = 2)
# calculate standard residuals
stand_resid <- (resid - mean(resid))/sd(resid)
# fitted versus standard residual plot
plot(RF_train_predict$.pred, stand_resid,
ylab = "Standard Residuals",
xlab = "Fitted Values",
main = "Fitted versus Standard Residuals")
abline(h = 0, col = 'blue', lty = 2)
# qq plot
qqPlot(stand_resid, main = "QQ-norm Plot",
id = FALSE,
ylab = "Standard Residuals",
xlab = "Norm Quantiles")
# setting a common title
mtext("Diagnostic Plots", side = 3, line = -2, cex = 1.5, outer = TRUE)
Diagnostic plot of the random forest model. We see somewhat of a linear trend in the actual versus predicted plot (top left) with some outliers. This implies that majority of our prediction complied with the observed values except for a few outliers. The fitted versus residuals plot (top right) indicates no specific pattern. However, there does appears to be quite a few outliers. This also occured in the fitted versus standard residuals plot (bottom left). Finally, the qq-plot appears to be heavy-tailed.
# pull the variable importance from model
importance_df <- importance(RF_fit$fit$fit$fit) %>% data.frame()
# obtain the top 10 important variables
importance_df_top10 <- importance_df %>% rownames_to_column() %>%
tibble() %>%
arrange(desc(IncNodePurity)) %>%
slice_max(order_by = IncNodePurity, n = 10) %>%
mutate_if(is.numeric, round, 2)
# print the result
importance_df_top10
# plot the partial dependence plot
varImpPlot(RF_fit$fit$fit$fit,
main = "Partial Dependence Plot",
sub = "Partial Dependence Plot for top 10 importance variables")
Partial Dependence Plot. Identifies the singificant factors among all within the dataset.
According to the Partial Dependence Plot, the variable Output_Domestic_Construction has the strongest effect on the target variable, disasters, followed by Output_Domestic_Agriculture, Output_Domestic_Manufacturing, and Output_Domestic_Food. The rest of variables shows a lot weaker effect on target variable, disasters.
The five add on we have chosen are listed below.
From a quick investigation in the variables using Principle Component Analysis, we see that the first four principle components explained roughly \(74.789\%\) of the variances in the dataset. A closer look at the first four principle components then revealed that only the first principle component is interpretable. In the first principle component, all factors have negative effects while only the year factor has positive effect.
# mutate all predictors to numeric (only year as factor and non numeric,
# all predictor numeric at this point)
# perform PCA with scaling
train_PCA <- train_sub %>%
mutate_if(~ any(!is.na(as.numeric(.))), ~as.numeric(.)) %>%
prcomp(scale = TRUE)
train_loadings <- broom::tidy(train_PCA, matrix = "loadings")
train_scores <- broom::tidy(train_PCA, matrix = "scores")
train_variances <- broom::tidy(train_PCA, matrix = "pcs")
train_variances %>% ggplot(aes(x = PC, y = percent)) +
geom_segment(aes(xend = PC), yend = 0, linewidth = 5) +
ylab("Percentage of Variance Explained") +
ggtitle("Percentage of Variances Explained by Each Principle Components")
Percentage of Vairnaces Explained by Each Principle Components. The first couple principle components seems to add up to the majority of the percentage of variances.
cat("Percentage of variances explained by the first four principle components is",
(train_variances$percent[1:4] %>% sum())*100, "%")
## Percentage of variances explained by the first four principle components is 74.789 %
# looking at the first two principle components
train_loadings %>%
filter(PC <=2) %>%
ggplot(aes(y = column, x = value)) +
geom_segment(aes(yend = column), xend = 0, linewidth = 4) +
facet_wrap(~PC) +
theme(axis.text.y=element_text(size=rel(0.75)))
Factor loadings for the first two principle components. We see that for the first component, all factors have negative effects while only year has positive effect. And in the second component does not seems to be interpretable
# looking at the third and fourth principle components
train_loadings %>%
filter(PC > 2 & PC <= 4) %>%
ggplot(aes(y = column, x = value)) +
geom_segment(aes(yend = column), xend = 0, linewidth = 4) +
facet_wrap(~PC) +
theme(axis.text.y=element_text(size=rel(0.75)))
Factor loadings for the third and fourth principle components. Both factoring plot does not seem to be interpretable.
# prepping dataframe for the word cloud, dropping years 2017 and 2018 since they
# do not contain any emission values.
word_cloud_df <- df_long %>% select(-ISO3, Category, Disaster_Frequency) %>%
group_by(Year, Sector_abb) %>%
reframe(emission = sum(Value)) %>%
filter(!(Year == '2017' | Year == '2018'))
# plot the word cloud, emission as sizing value and facet wrap by year
word_cloud_df %>% ggplot() +
geom_text_wordcloud(aes(label = Sector_abb, size = emission)) +
scale_size_area(max_size = 2.8) +
facet_wrap(~Year)
Function defined in Random Forest model fitting chunk.
Metrics used to evaluate the models are Mean Absolute Error (mae), Root Mean Squared Error (rmse), and residual Squared (rsq), each are a measure of deviation or error of predicted values from observed values. Symmetric Mean Absolute Percentage Error (sMAPE) to measure the relative accuracy of a forecast.
####Note: conflicts with knn code above, subject to edits.
##########Must run knn_prediction_code.R before running this R script file#########
# Out of sample metric: sMAPE
# sMAPE = 1/n * (Σ|( Actual — Predicted)|/(Actual + Predicted/2))
# sMAPE is useful when actual value contains zero, which in our case that
# disaster frequency is often 0.
# using MAPE with actual value of 0 will result in infinity error because value is
# not divisible by 0.
# an alternative to this is sMAPE.
# sMAPE lower bound is 0% and upper bound is 200%.
# calculate sMAPE for each models on both the train set and test set
# manually mutating the actual values to prevent NaN's in sMAPE later.
# It appears that there may have been multiple identical observations in the actual
# value, causing the sMAPE to output NaN's for all of the prediction for knn.
train_actual <- train_sub$Disaster_Frequency %>% as_tibble() %>%
rename(actual = value) %>%
mutate(actual = actual + 0.001)
test_actual <- test_sub$Disaster_Frequency %>% as_tibble() %>%
rename(actual = value) %>%
mutate(actual = actual + 0.001)
# random forest
RF_train_sMAPE <- RF_train_predict %>%
bind_cols(train_actual) %>%
smape(truth = actual, estimate = .pred)
RF_test_sMAPE <- RF_test_predict %>%
bind_cols(test_actual) %>%
smape(truth = actual, estimate = .pred)
# xg-boost
boost_train_sMAPE <- boost_train_predict %>%
bind_cols(train_actual) %>%
smape(truth = actual, estimate = .pred)
boost_test_sMAPE <- boost_test_predict %>%
bind_cols(test_actual) %>%
smape(truth = actual, estimate = .pred)
# poisson lasso
poi_train_sMAPE <- poi_fitted %>% predict(train_sub) %>%
bind_cols(train_actual) %>%
smape(truth = actual, estimate = .pred)
poi_test_sMAPE <- poi_fitted %>% predict(test_sub) %>%
bind_cols(test_actual) %>%
smape(truth = actual, estimate = .pred)
# poisson elastic net
poi_fitted_enp_train_sMAPE <- poi_fitted_enp %>% predict(train_sub) %>%
bind_cols(train_actual) %>%
smape(truth = actual, estimate = .pred)
poi_fitted_enp_test_sMAPE <- poi_fitted_enp %>% predict(test_sub) %>%
bind_cols(test_actual) %>%
smape(truth = actual, estimate = .pred)
# knn
knn5_train_sMAPE <- knn_5_train_pred %>%
bind_cols(train_actual) %>%
smape(truth = actual, estimate = .pred)
knn5_test_sMAPE <- knn_5t %>%
bind_cols(test_actual) %>%
smape(truth = actual, estimate = .pred)
knn10_train_sMAPE <- knn_10_train_pred %>%
bind_cols(train_actual) %>%
smape(truth = actual, estimate = .pred)
knn10_test_sMAPE <- knn_10t %>%
bind_cols(test_actual) %>%
smape(truth = actual, estimate = .pred)
knn20_train_sMAPE <- knn_20_train_pred %>%
bind_cols(train_actual) %>%
smape(truth = actual, estimate = .pred)
knn20_test_sMAPE <- knn_20t %>%
bind_cols(test_actual) %>%
smape(truth = actual, estimate = .pred)
# gather results
model_selection_result <- bind_rows(RF_train_sMAPE, RF_test_sMAPE,
boost_train_sMAPE, boost_test_sMAPE,
poi_train_sMAPE, poi_test_sMAPE,
poi_fitted_enp_train_sMAPE,
poi_fitted_enp_test_sMAPE,
knn5_train_sMAPE, knn5_test_sMAPE,
knn10_train_sMAPE,knn10_test_sMAPE,
knn20_train_sMAPE, knn20_test_sMAPE)
# organize results into one dataframe
model_selection_result <-
data.frame(prediction = c("RF_train", "RF_test",
"Boost_train", "Boost_test",
"Poisson_lasso_train", "Poisson_lasso_test",
"Poisson_elastic_net_train", "Poisson_elastic_net_test",
"knn5_train", "knn5_test",
"knn10_train", "knn10_test",
"knn20_train", "knn20_test")) %>%
as_tibble() %>%
bind_cols(model_selection_result)
# display the smape performance for all models
model_selection_result
Validation of held-out data has been performed above for each model. Below gathers all the held-out metrics from above into one table.
# Reformatting knn metric table to combine
knn <- knn_held_metrics %>% rename(.estimate=value,.metric=name) %>%
mutate(.estimator="standard",.after=.metric) %>%
select(.metric,.estimator, .estimate)
knn_val_5 <-knn %>% slice(c(1:3))
knn_val_10 <-knn %>% slice(c(4:6))
knn_val_20 <-knn %>% slice(c(7:9))
# gather validation result for all models.
val<- bind_rows(
RF_train_predict_metric,
boost_train_predict_metrics,
poi_prediction_1,
poi_prediction_2,
knn_val_5,
knn_val_10,
knn_val_20,
.id = "Model") %>%
mutate(Model = dplyr::recode(Model,
"1" = "Random Forest",
"2" = "XG Boost",
"3" = "Poisson Lasso",
"4" = "Poisson Elastic Net",
"5"= "K-Nearest (k=5)",
"6"= "K-Nearest (k=10)",
"7"= "K-Nearest (k=20)")) %>%
pivot_wider(names_from = c(".metric"),
values_from = .estimate)
val
From the metrics of held out data, it is clear random forest and XG boost outperforms the other models resulting in a lower MAE and RMSE with an R squared of \(0.84\) and \(0.81\).
One of the main human impact worsening climate change is our carbon foot print. With developing countries producing the most carbon dioxide from output of domestic controlled enterprises, this is correlated with the higher disaster frequency in these countries.
The best model out of the five we have chosen is random forest. The model explains roughly \(70\) of the variation in the model. The mean of squared residuals (or mse) is \(8.411\). This implies that the fitted values are on average \(8.411\) away from the observed values. In addition, the error metrics were significantly lower than the other models with rmse at \(2.17\) and the mae at \(1.05\). From the rsq value of \(0.84\), we can conclude that the model explains approximately \(84\%\) of the variability in the disaster frequency.
From our analysis, we can report Disaster frequency is highly correlated to CO\(2\) emissions. Extracting from the random forest, the ‘Partial Dependence Plot’ the most influential variables are Output_Domestic_Construction, Output_Domestic_Construction_Agriculture, and Output_Domestic_Construct_Manufacturing. This concurs with the fifth visual produced “CO\(2\) emissions by Category of Enterprise for Each Country,” where Output_Domestic contributes to the most CO\(2\) produced. Based on the world cloud, the most prominent sectors are transportation and energy.
A problem we encountered that we were not able to fix, is to run the Poisson models with the countries. To keep our models consistent, we used the same training data for all \(5\) models. This means none of our model consists of the variable country.
With that said a suggestion for the future would be to include countries for a more comprehensive analysis. As the hyperparameter was tuned for random forest, the same can be done for k-nearest neighbor.
International Monetary Fund. 2022.Climate Change Indicators Dashboard. Climate Change Data and , https://climatedata.imf.org/pages/access-data. Accessed on [2023-03-15].